Since data manipulation is easier in pandas, and there are multiple iterative steps ahead, we shall stick to a discipline. Data Pre Processing and Data Standardization are two main stages of data manipulation. After that, we only tune classifiers and evaluate. So for these first two stages, after we load raw data_dict
, we shall convert to pandas data frame, and perform operations related to these stages and then before Classifier tuning, we convert it back to my_dataset
as needed by our starter code.
from graphviz import Digraph
g = Digraph('Work Flow', node_attr={'shape': 'record'})
g.attr(rankdir='LR')
g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#e6a26a')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')
g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')
g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g
Jumping into our current code setup, we thus would have something like this to start up. Please recall functions from earlier section, now they are just pushed to library for brevity. When we define new functions,we also will add them to library when we try to reuse next time. Also note, we have switched to Decision tree classifier as decided earlier.
from helpers_2 import init, split_data, tree_classify, evaluate
import pandas as pd
# raw data
data_dict = init()
# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index') # 'index' means keys should be rows
#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
# 1.2. Remove Outliers
# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------
# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}
# CLASSIFIER
my_dataset = data_dict # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0) # random state 0 to maintain consistent results everytime its run
# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
1.1 Cleaning Invalid Values¶
Let us have a sneak peak in to our raw data
df.head()
Note the invalid values are marked with 'NaN'. Note that out of all features, except email_address
and poi
others are numerical. For these numerical values, we learn, 'NaN' means not, not available but 0. Before trying to spot outliers, we need to clean such invalid entries
def clean_df(raw_df):
"""
Replace 'NaN' with 0 in all feature columns, except 'poi' and 'email_address'
Return updated df
"""
# replace NaN except poi and email_address (ref: https://stackoverflow.com/questions/42916989/replace-missing-values-in-all-columns-except-one-in-pandas-dataframe)
data_panda_1 = raw_df.loc[:, raw_df.columns.difference(['poi','email_address'])].replace('NaN',0, regex=True)
# caz we have to now concat caz of these cols lost in above df
data_panda_2 = raw_df[['poi','email_address']]
# concat (remember axis 1)
data_panda = pd.concat([data_panda_1, data_panda_2],axis=1, sort=False)
# fix exponential display formatdue to replace operaiton (ref: https://stackoverflow.com/questions/42735541/customized-float-formatting-in-a-pandas-dataframe)
pd.options.display.float_format = lambda x : '{:.0f}'.format(x) # we dont both about decimal accuracies here for calculated means, stds etc.
return data_panda
test_df = clean_df(df)
test_df.head()
Now, let us include in our main workflow and check scores..
from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd
# raw data
data_dict = init()
# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index') # 'index' means keys should be rows
#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)
# 1.2. Remove Outliers
# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------
# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}
# CLASSIFIER
my_dataset = data_dict # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0) # random state 0 to maintain consistent results everytime its run
# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
No difference, this only means sklearn simply took 'NaN's as 0, or ignored. Nevertheless, our df is now cleaner for spotting outliers.
1.2 Remove Outliers¶
Outliers could be of many types. For numerical features, some of them could be
- Those that max out too farther away in either positive or negative direction (can spot in graph)
- Those that are out of place (for eg, should be always positive, but outlier is negative)
The 1st type is easy to detect usually by visualizing. However, some may require more than 1 iteration. That is, an outlier may become visibly obvious, only after removing another which was much farther.
The 2nd type is usually harder to detect, as that needs knowledge about the features. For eg, we have total_stock_value
, which is supposed to be always positive because its the sum of stock prices which themselves cannot be negative, so a value negative under this could be an outlier. If too negative, it might show up in visuals, but if not, might be difficult to spot.
Outlier Iteration 1¶
Let us start spotting by visualizing. For brevity, the visualizing functions are abstracted. The outliers are annotated as well (I had to code upfront who are they to annotate)
We try to spot the outliers of type 1 - maxing out in positive direction
from helpers_2 import stripplots, get_feature_max
# test df for our analysis
test_df = df
# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = [
'bonus', 'deferral_payments', 'director_fees','exercised_stock_options',
'expenses','from_messages', 'from_poi_to_this_person', 'loan_advances',
'long_term_incentive','other','restricted_stock','restricted_stock_deferred',
'salary','total_payments','total_stock_value'
]
outlier_dict = get_feature_max(test_df, outlier_feature_list)
# plot
stripplots(test_df, outlier_dict)
Let us find out who are those outliers, and remove them first.
from helpers_2 import get_index_label
outlier_list = []
for k,v in outlier_dict.iteritems():
outlier = get_index_label(test_df, k, v)
outlier_list.append(outlier)
outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list
TOTAL
obviously seems to be an entry error, that the TOTAL values of features from source, are also taken as a valid data point by mistake. Others are not POIs anyway, so we go ahead removing them all.
test_df = test_df.drop(outlier_list)
Outlier Iteration 2¶
Now that we have removed some of the outliers, let us visualize again with updated data frame. This time, we also inspect negative max outlier.
from helpers_2 import get_feature_min
# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = ['deferral_payments', 'from_messages', 'long_term_incentive' ]
outlier_dict = get_feature_max(test_df, outlier_feature_list)
outlier_dict_min = get_feature_min(test_df, ['restricted_stock_deferred']) # negative max as seen on graph
outlier_dict.update(outlier_dict_min)
# plot
stripplots(test_df, outlier_dict)
outlier_list = []
for k,v in outlier_dict.iteritems():
outlier = get_index_label(test_df, k, v)
outlier_list.append(outlier)
#outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list
test_df = test_df.drop(outlier_list)
Outlier Iteration 3¶
We have almost got rid of type 1 outliers. Type 2 was difficult ( I found this only because of referencing how others did it online ). To spot type 2, let us use pandas description feature, instead of visuals.
description = test_df.describe().loc[['min','max','75%','mean']]
print description
Most of the max values found above, even if found too much compared to 75% could still be ignored, because we alreadyd visually analyzed that they are POIs, so nothing much of non-POIs standing out.
Let us check if total_payments
and total_stock_value
are properly calculated. They are sum of individual related feature columns as shown below.
Checking total payments¶
total_payments_components = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments','loan_advances','other','expenses','director_fees']
# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_payments_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas
# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_payments'] == total_df['total_check']) ,'correct','wrong')
total_df['result'].value_counts()
print get_index_label(total_df,'result','wrong')
Checking total stocks¶
total_stock_components = ['exercised_stock_options','restricted_stock','restricted_stock_deferred']
# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_stock_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas
# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_stock_value'] == total_df['total_check']) ,'correct','wrong')
total_df['result'].value_counts()
print get_index_label(total_df,'result','wrong')
So we have a type 2 outlier here. Let us check if he is a POI.
def is_POI(input_df, index_label):
"""
Given a data frame and index label this function returns if given index label is POI (true) or not
"""
return input_df.loc[index_label,'poi']
is_POI(test_df, 'BELFER ROBERT')
He is not a POI. Thus we could remove him as well.
outlier_list = ['BELFER ROBERT']
test_df = test_df.drop(outlier_list)
Outlier Summary¶
Through 3 iterations (2 visual, 1 computative), we have found few outliers which we shall remove and check our scores.
def remove_outliers(input_df, outlier_list):
return input_df.drop(outlier_list)
from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd
# raw data
data_dict = init()
# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index') # 'index' means keys should be rows
#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)
# 1.2. Remove Outliers
outlier_list = ['LAVORATO JOHN J', 'TOTAL', 'KAMINSKI WINCENTY J', 'BHATNAGAR SANJAY'] # iteration 1
outlier_list += ['KEAN STEVEN J', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'MARTIN AMANDA K'] # iteration 2
outlier_list += ['BELFER ROBERT'] # iteration 3
df = remove_outliers(df, outlier_list)
# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------
# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}
# CLASSIFIER
my_dataset = data_dict # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0) # random state 0 to maintain consistent results everytime its run
# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
Though accuracy slighly improved, Precision and Recall fared well compared to scores before Outlier removal.
Note, we still only use one feature salary
for classification, which could be reason for low accuracy. In next section we will explore adding more features and scaling them.
So far,
- We cleaned the data (replace NaN with 0)
- We removed few outliers
This nearly concludes our Data Pre Processing
from graphviz import Digraph
g = Digraph('Work Flow', node_attr={'shape': 'record'})
g.attr(rankdir='LR')
g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#4be00b')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')
g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')
g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g